Download course materials from here.
bit.ly/Leeds_2019-02-28
install.packages(tidyverse,ordinal,lme4)
Dataset
sleep:
What does this do?
sleep[1,]
How does this differ from [1,]?
How does this differ from [3,]?
sleep[,3] # what do you think this is?
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7
[18] 8 9 10
Levels: 1 2 3 4 5 6 7 8 9 10
…which makes it dataset[row,column]
You can also navigate with column names:
sleep$ID
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7
[18] 8 9 10
Levels: 1 2 3 4 5 6 7 8 9 10
How would you view the column extra?
sleep$extra
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0
[11] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4
Use str() to get a summary of the structure of the dataset
str(sleep)
'data.frame': 20 obs. of 3 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
What are all the unique values in ID?
unique(sleep$extra)
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0
[11] 1.9 1.1 0.1 4.4 5.5 1.6 4.6
What’s the value in the first row, third column?
sleep$ID[1]
[1] 1
Levels: 1 2 3 4 5 6 7 8 9 10
What’s the first element in the column ID?
You can also view the dataset as a spreadsheet (although it can’t be altered).
library(tidyverse)
[30m── [1mAttaching packages[22m ────────────── tidyverse 1.2.1 ──[39m
[30m[32m✔[30m [34mggplot2[30m 3.1.0 [32m✔[30m [34mpurrr [30m 0.3.0
[32m✔[30m [34mtibble [30m 2.0.1 [32m✔[30m [34mdplyr [30m 0.8.0.[31m1[30m
[32m✔[30m [34mtidyr [30m 0.8.2 [32m✔[30m [34mstringr[30m 1.4.0
[32m✔[30m [34mreadr [30m 1.3.1 [32m✔[30m [34mforcats[30m 0.4.0 [39m
[30m── [1mConflicts[22m ───────────────── tidyverse_conflicts() ──
[31m✖[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31m✖[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
A tibble is different from a table.
Pipes are like toy funnels
%>%
How many attestations of each type of group?
How can you make a new column? Duplicate group into group2:
Let’s rename the levels in group to be “one” and “two”:
What’s wrong with this one?
sleep %>%
mutate(group2 = case_when(group==1 ~ as.factor("one"),
group==2 ~ as.factor("two")))
invalid factor level, NA generated
Now, let’s recreate the count function with group_by and summarise:
We cal use group_by and summarise to do a lot more than just count:
Dataset
quakes:
What does quakes look like?
quakes %>%
str
'data.frame': 1000 obs. of 5 variables:
$ lat : num -20.4 -20.6 -26 -18 -20.4 ...
$ long : num 182 181 184 182 182 ...
$ depth : int 562 650 42 626 649 195 82 194 211 622 ...
$ mag : num 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
$ stations: int 41 15 43 19 11 12 43 15 35 19 ...
How many observations are there per “level” of magnitude?
Let’s create a table of the means, standard deviations, and standard errors for both stations reporting and depths:
This dataset is wide. Let’s make it long using gather (compare to spread):
quakes %>%
group_by(mag) %>%
summarise(n = n(),
stationMean = mean(stations),
stationSD = sd(stations),
stationSE = sd(stations)/sqrt(n),
depthMean = mean(depth),
depthSD = sd(depth),
depthSE = sd(depth)/sqrt(n)) %>%
gather(measurement, values, 3:8) #-> quakesLong
install.packages("ggbeeswarm")
Error in install.packages : Updating loaded packages
Dataset
iris:
The function ggplot layers different geometries and aesthetics to build up a plot:
In what ways might we change this plot?
Going back to quakes, let’s pipe our table into a ggplot (and fill in missing values):
quakes %>%
group_by(mag) %>%
summarise(n=n(),
sta=mean(stations),
staSD=sd(stations),
staSE=staSD/sqrt(n),
dep=mean(depth),
depSD=sd(depth),
depSE=depSD/sqrt(n)) %>%
ggplot(aes(x=mag)) +
geom_point(aes(y=sta),colour="red")+
geom_path(aes(y=sta),colour="red")+
geom_ribbon(aes(ymin=sta-staSD,ymax=sta+staSD),width=.05,fill="red",alpha=.2) +
geom_ribbon(aes(ymin=sta-staSE,ymax=sta+staSE),width=.05,fill="red",alpha=.5) +
geom_point(aes(y=dep),colour="blue")+
geom_path(aes(y=dep),colour="blue")+
geom_ribbon(aes(ymin=dep-depSD,ymax=dep+depSD),width=.05,fill="blue",alpha=.2) +
geom_ribbon(aes(ymin=dep-depSE,ymax=dep+depSE),width=.05,fill="blue",alpha=.5) +
theme_bw() + ylab("depth OR number of stations reporting")
Ignoring unknown parameters: widthIgnoring unknown parameters: widthIgnoring unknown parameters: widthIgnoring unknown parameters: width
Inheritance:
See Simulated Data notebook
data <- read_csv("../data/simulated-data.csv")
Parsed with column specification:
cols(
subj = [32mcol_double()[39m,
age = [32mcol_double()[39m,
item = [32mcol_double()[39m,
freq = [31mcol_character()[39m,
gram = [31mcol_character()[39m,
rating = [32mcol_double()[39m,
accuracy = [32mcol_double()[39m,
region = [32mcol_double()[39m,
word = [31mcol_character()[39m,
rt = [32mcol_double()[39m
)
Plot boxplots and violin plots for the ratings. Subset by participant.
Group by region, word, frequency, and grammaticality. Summarise mean and standard error.
Well, check out the Datasaurus Dozen, which all have the same x/y means and standard deviations:
library(lme4)
Loading required package: Matrix
Attaching package: ‘Matrix’
The following object is masked from ‘package:tidyr’:
expand
Build a simple linear model to examine region 3 (the verb):
# summary(
# lm ( DV ~ IV1 * IV2 + IV3 , data = data )
# )
summary(lm(rt ~ freq * gram + age, data[data$region==3,]))
Call:
lm(formula = rt ~ freq * gram + age, data = data[data$region ==
3, ])
Residuals:
Min 1Q Median 3Q Max
-266.663 -56.801 3.386 52.437 271.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.2832 12.1092 26.532 < 2e-16 ***
freqlow 31.2062 8.0033 3.899 0.000105 ***
gramyes -1.9600 8.0033 -0.245 0.806600
age 1.2848 0.2503 5.134 3.58e-07 ***
freqlow:gramyes 1.5559 11.3184 0.137 0.890697
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80.03 on 795 degrees of freedom
Multiple R-squared: 0.06839, Adjusted R-squared: 0.0637
F-statistic: 14.59 on 4 and 795 DF, p-value: 1.663e-11
Add in mixed effects for a linear mixed effects model:
#summary(
# lmer( DV ~ IV1 * IV2 + IV3 + ( 1 | RV1 ) + ( 1 | RV2 ) , data = data)
# )
summary(lmer(rt ~ freq+gram+freq:gram+age + (1|subj) + (1|item),data %>% filter(region==3)))
Linear mixed model fit by REML ['lmerMod']
Formula:
rt ~ freq + gram + freq:gram + age + (1 | subj) + (1 | item)
Data: data %>% filter(region == 3)
REML criterion at convergence: 9254.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.1961 -0.7094 0.0332 0.6754 3.3830
Random effects:
Groups Name Variance Std.Dev.
subj (Intercept) 129.944 11.399
item (Intercept) 8.753 2.958
Residual 6273.993 79.209
Number of obs: 800, groups: subj, 40; item, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 321.2832 13.9689 23.000
freqlow 31.2062 8.1389 3.834
gramyes -1.9600 8.1389 -0.241
age 1.2848 0.2946 4.362
freqlow:gramyes 1.5559 11.5101 0.135
Correlation of Fixed Effects:
(Intr) freqlw gramys age
freqlow -0.291
gramyes -0.291 0.500
age -0.902 0.000 0.000
frqlw:grmys 0.206 -0.707 -0.707 0.000
Bodo Winter telling it like it is. (Photo credit: Adam Schembri 27/02/2019)
We should do model comparison to assess the contribution of each of the factors to the overall fit. But, read the Bates et al and Barr et al papers for an overview of the debates around how to design and test models.
Let’s do model comparison for region 3:
anova(mdl1.int,mdl1.grm) # int vs grm
refitting model(s) with ML (instead of REML)
Data: data[data$region == 3, ]
Models:
mdl1.grm: rt ~ freq + age + accuracy + (1 | subj)
mdl1.int: rt ~ freq + gram + age + accuracy + (1 | subj)
Df AIC BIC logLik deviance Chisq Chi Df
mdl1.grm 6 9287.4 9315.5 -4637.7 9275.4
mdl1.int 7 9289.4 9322.1 -4637.7 9275.4 0.0451 1
Pr(>Chisq)
mdl1.grm
mdl1.int 0.8318
How do regions 4 and 5 compare?:
First, how could we go about using lmer for rating data?
mdl.ord <- lmer(rating ~ freq*gram + (1|subj), data%>%filter(region==1))
summary(mdl.ord)
Linear mixed model fit by REML ['lmerMod']
Formula: rating ~ freq * gram + (1 | subj)
Data: data %>% filter(region == 1)
REML criterion at convergence: 2376.8
Scaled residuals:
Min 1Q Median 3Q Max
-3.04800 -0.83352 -0.04503 0.75865 2.94849
Random effects:
Groups Name Variance Std.Dev.
subj (Intercept) 0.01176 0.1085
Residual 1.11860 1.0576
Number of obs: 800, groups: subj, 40
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.39500 0.07673 31.215
freqlow -0.47000 0.10576 -4.444
gramyes 1.82000 0.10576 17.208
freqlow:gramyes 0.32000 0.14957 2.139
Correlation of Fixed Effects:
(Intr) freqlw gramys
freqlow -0.689
gramyes -0.689 0.500
frqlw:grmys 0.487 -0.707 -0.707
library(ordinal)
Attaching package: ‘ordinal’
The following objects are masked from ‘package:lme4’:
ranef, VarCorr
The following object is masked from ‘package:dplyr’:
slice
For the ratings, build models like above, but using clmm() (these take a little longer to run):
mdl.ord.clmm <- clmm(as.factor(rating) ~ freq*gram + age + (1|subj), data%>%filter(region==1))
summary(mdl.ord.clmm)
Cumulative Link Mixed Model fitted with the Laplace approximation
formula: as.factor(rating) ~ freq * gram + age + (1 | subj)
data: data %>% filter(region == 1)
Random effects:
Groups Name Variance Std.Dev.
subj (Intercept) 0.02675 0.1636
Number of groups: subj 40
Coefficients:
Estimate Std. Error z value Pr(>|z|)
freqlow -0.928986 0.186594 -4.979 6.4e-07 ***
gramyes 2.846015 0.207608 13.709 < 2e-16 ***
age -0.007218 0.006135 -1.177 0.2394
freqlow:gramyes 0.674074 0.264725 2.546 0.0109 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.35904 0.30108 -4.514
2|3 -0.06295 0.29158 -0.216
3|4 1.14857 0.29870 3.845
4|5 2.52700 0.31329 8.066
See visualising_ordinal_data.R for Predicted Curves script
data %>%
filter(region==1) %>%
mutate(age_group = case_when(age<35~"young",
age>=35&age<=55~"middle",
age>55~"old")) %>%
mutate(age_group = factor(age_group,levels=c("young","middle","old")))%>%
group_by(freq,gram,age_group) %>%
summarise(accuracy=sum(accuracy)/n()) %>%
ggplot(aes(x=freq,fill=gram,y=accuracy))+
geom_bar(stat="identity",position="dodge")+
facet_wrap(~age_group)
Does accuracy change as a function of age?
# summary(
# glmer( DV ~ IV1 * IV2 + IV3 + ( 1 | RV1 ) + ( 1 | RV2 ), family="binomial", data = data)
# )
Do model comparison to assess the contribution of each of the factors to the overall fit: